100 research outputs found
Recommended from our members
Using attribution to decode binding mechanism in neural network models for chemistry
Deep neural networks have achieved state-of-the-art accuracy at
classifying molecules with respect to whether they bind to specific
protein targets. A key breakthrough would occur if these models
could reveal the fragment pharmacophores that are causally
involved in binding. Extracting chemical details of binding from
the networks could enable scientific discoveries about the mechanisms
of drug actions. However, doing so requires shining light
into the black box that is the trained neural network model, a
task that has proved difficult across many domains. Here we show
how the binding mechanism learned by deep neural network
models can be interrogated, using a recently described attribution
method. We first work with carefully constructed synthetic
datasets, in which the molecular features responsible for “binding”
are fully known. We find that networks that achieve perfect
accuracy on held-out test datasets still learn spurious correlations,
and we are able to exploit this nonrobustness to construct adversarial
examples that fool the model. This makes these models
unreliable for accurately revealing information about the mechanisms
of protein–ligand binding. In light of our findings, we
prescribe a test that checks whether a hypothesized mechanism
can be learned. If the test fails, it indicates that the model must be
simplified or regularized and/or that the training dataset requires
augmentation.M.P.B. gratefully acknowledges support from the National Science Foundation through NSF-DMS1715477, as well as support from the Simons Foundation. L.J.C. gratefully acknowledges a Next Generation fellowship, a Marie Curie Career Integration Grant (Evo-Couplings, 631609), and support from the Simons Foundation. F.M. performed work during an internship at Google
Protein sectors: statistical coupling analysis versus conservation
Statistical coupling analysis (SCA) is a method for analyzing multiple
sequence alignments that was used to identify groups of coevolving residues
termed "sectors". The method applies spectral analysis to a matrix obtained by
combining correlation information with sequence conservation. It has been
asserted that the protein sectors identified by SCA are functionally
significant, with different sectors controlling different biochemical
properties of the protein. Here we reconsider the available experimental data
and note that it involves almost exclusively proteins with a single sector. We
show that in this case sequence conservation is the dominating factor in SCA,
and can alone be used to make statistically equivalent functional predictions.
Therefore, we suggest shifting the experimental focus to proteins for which SCA
identifies several sectors. Correlations in protein alignments, which have been
shown to be informative in a number of independent studies, would then be less
dominated by sequence conservation.Comment: 36 pages, 17 figure
Recommended from our members
Statistical and machine learning approaches to predicting protein-ligand interactions.
Data driven computational approaches to predicting protein-ligand binding are currently achieving unprecedented levels of accuracy on held-out test datasets. Up until now, however, this has not led to corresponding breakthroughs in our ability to design novel ligands for protein targets of interest. This review summarizes the current state of the art in this field, emphasizing the recent development of deep neural networks for predicting protein-ligand binding. We explain the major technical challenges that have caused difficulty with predicting novel ligands, including the problems of sampling noise and the challenge of using benchmark datasets that are sufficiently unbiased that they allow the model to extrapolate to new regimes
Recommended from our members
Comparative analysis of nanobody sequence and structure data.
Nanobodies are a class of antigen-binding protein derived from camelids that achieve comparable binding affinities and specificities to classical antibodies, despite comprising only a single 15 kDa variable domain. Their reduced size makes them an exciting target molecule with which we can explore the molecular code that underpins binding specificity-how is such high specificity achieved? Here, we use a novel dataset of 90 nonredundant, protein-binding nanobodies with antigen-bound crystal structures to address this question. To provide a baseline for comparison we construct an analogous set of classical antibodies, allowing us to probe how nanobodies achieve high specificity binding with a dramatically reduced sequence space. Our analysis reveals that nanobodies do not diversify their framework region to compensate for the loss of the VL domain. In addition to the previously reported increase in H3 loop length, we find that nanobodies create diversity by drawing their paratope regions from a significantly larger set of aligned sequence positions, and by exhibiting greater structural variation in their H1 and H2 loops
Charge as a Selection Criterion for Translocation through the Nuclear Pore Complex
Nuclear pore complexes (NPCs) are highly selective filters that control the exchange of material between nucleus and cytoplasm. The principles that govern selective filtering by NPCs are not fully understood. Previous studies find that cellular proteins capable of fast translocation through NPCs (transport receptors) are characterized by a high proportion of hydrophobic surface regions. Our analysis finds that transport receptors and their complexes are also highly negatively charged. Moreover, NPC components that constitute the permeability barrier are positively charged. We estimate that electrostatic interactions between a transport receptor and the NPC result in an energy gain of several kBT, which would enable significantly increased translocation rates of transport receptors relative to other cellular proteins. We suggest that negative charge is an essential criterion for selective passage through the NPC.Merck Research LaboratoriesNational Science Foundation (U.S.) (Division of Mathematical Sciences)Kavli Institute for Bionano Science & Technology at Harvard UniversityNational Centers for Systems Biology (U.S.) (NIGMS grant GM068763)National Institute of General Medical Sciences (U.S.
Conservation Weighting Functions Enable Covariance Analyses to Detect Functionally Important Amino Acids
The explosive growth in the number of protein sequences gives rise to the possibility of using the natural variation in sequences of homologous proteins to find residues that control different protein phenotypes. Because in many cases different phenotypes are each controlled by a group of residues, the mutations that separate one version of a phenotype from another will be correlated. Here we incorporate biological knowledge about protein phenotypes and their variability in the sequence alignment of interest into algorithms that detect correlated mutations, improving their ability to detect the residues that control those phenotypes. We demonstrate the power of this approach using simulations and recent experimental data. Applying these principles to the protein families encoded by Dscam and Protocadherin allows us to make testable predictions about the residues that dictate the specificity of molecular interactions
Inferring interaction partners from protein sequences.
Specific protein-protein interactions are crucial in the cell, both to ensure the formation and stability of multiprotein complexes and to enable signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interaction partners, causing their sequences to be correlated. Here we exploit these correlations to accurately identify, from sequence data alone, which proteins are specific interaction partners. Our general approach, which employs a pairwise maximum entropy model to infer couplings between residues, has been successfully used to predict the 3D structures of proteins from sequences. Thus inspired, we introduce an iterative algorithm to predict specific interaction partners from two protein families whose members are known to interact. We first assess the algorithm's performance on histidine kinases and response regulators from bacterial two-component signaling systems. We obtain a striking 0.93 true positive fraction on our complete dataset without any a priori knowledge of interaction partners, and we uncover the origin of this success. We then apply the algorithm to proteins from ATP-binding cassette (ABC) transporter complexes, and obtain accurate predictions in these systems as well. Finally, we present two metrics that accurately distinguish interacting protein families from noninteracting ones, using only sequence data.Human Frontier Science Program, National Institutes of Health (Grant ID: R01-GM082938), National Science Foundation (Grant ID: PHY-1305525), Marie Curie (Career Integration Grant ID: 631609), Next Generation Fellowship, Eric and Wendy Schmidt Transformative Technology FundThis is the author accepted manuscript. The final version is available from the Proceedings of the National Academy of Sciences of the United States of America via https://doi.org/10.1073/pnas.160676211
- …